This notebook demonstrates the experience of using ML Workbench to create a machine learning model for text classification and setting it up for online prediction. Training the model is done "locally" inside Datalab. In the next notebook (Text Classification --- 20NewsGroup (large data)), it demonstrates how to do it by using Cloud ML Engine services.
If you have any feedback, please send them to datalab-feedback@google.com.
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics. The classification problem is to identify the newsgroup a post was summited to, given the text of the post.
There are a few versions of this dataset from different sources online. Below, we use the version within scikit-learn which is already split into a train and test/eval set. For a longer introduction to this dataset, see the scikit-learn website
In [59]:
import numpy as np
import pandas as pd
import os
import re
import csv
from sklearn.datasets import fetch_20newsgroups
In [60]:
# data will be downloaded. Note that an error message saying something like "No handlers could be found for
# logger sklearn.datasets.twenty_newsgroups" might be printed, but this is not an error.
news_train_data = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
news_test_data = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
In [61]:
news_train_data.data[2], news_train_data.target_names[news_train_data.target[2]]
Out[61]:
In [62]:
def clean_and_tokenize_text(news_data):
"""Cleans some issues with the text data
Args:
news_data: list of text strings
Returns:
For each text string, an array of tokenized words are returned in a list
"""
cleaned_text = []
for text in news_data:
x = re.sub('[^\w]|_', ' ', text) # only keep numbers and letters and spaces
x = x.lower()
x = re.sub(r'[^\x00-\x7f]',r'', x) # remove non ascii texts
tokens = [y for y in x.split(' ') if y] # remove empty words
tokens = ['[number]' if x.isdigit() else x for x in tokens] # convert all numbers to '[number]' to reduce vocab size.
cleaned_text.append(tokens)
return cleaned_text
In [63]:
clean_train_tokens = clean_and_tokenize_text(news_train_data.data)
clean_test_tokens = clean_and_tokenize_text(news_test_data.data)
In [64]:
def get_unique_tokens_per_row(text_token_list):
"""Collect unique tokens per row.
Args:
text_token_list: list, where each element is a list containing tokenized text
Returns:
One list containing the unique tokens in every row. For example, if row one contained
['pizza', 'pizza'] while row two contained ['pizza', 'cake', 'cake'], then the output list
would contain ['pizza' (from row 1), 'pizza' (from row 2), 'cake' (from row 2)]
"""
words = []
for row in text_token_list:
words.extend(list(set(row)))
return words
In [65]:
# Make a plot where the x-axis is a token, and the y-axis is how many text documents
# that token is in.
words = pd.DataFrame(get_unique_tokens_per_row(clean_train_tokens) , columns=['words'])
token_frequency = words['words'].value_counts() # how many documents contain each token.
token_frequency.plot(logy=True)
Out[65]:
In [66]:
vocab = token_frequency[np.logical_and(token_frequency < 1000, token_frequency > 10)]
vocab.plot(logy=True)
Out[66]:
In [67]:
def filter_text_by_vocab(news_data, vocab):
"""Removes tokens if not in vocab.
Args:
news_data: list, where each element is a token list
vocab: set containing the tokens to keep.
Returns:
List of strings containing the final cleaned text data
"""
text_strs = []
for row in news_data:
words_to_keep = [token for token in row if token in vocab or token == '[number]']
text_strs.append(' '.join(words_to_keep))
return text_strs
In [68]:
clean_train_data = filter_text_by_vocab(clean_train_tokens, set(vocab.index))
clean_test_data = filter_text_by_vocab(clean_test_tokens, set(vocab.index))
In [69]:
# Check a few instances of cleaned data
clean_train_data[:3]
Out[69]:
In [70]:
!mkdir -p ./data
with open('./data/train.csv', 'w') as f:
writer = csv.writer(f, lineterminator='\n')
for target, text in zip(news_train_data.target, clean_train_data):
writer.writerow([news_train_data.target_names[target], text])
with open('./data/eval.csv', 'w') as f:
writer = csv.writer(f, lineterminator='\n')
for target, text in zip(news_test_data.target, clean_test_data):
writer.writerow([news_test_data.target_names[target], text])
# Also save the vocab, which will be useful in making new predictions.
with open('./data/vocab.txt', 'w') as f:
vocab.to_csv(f)
The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the cleaned data from the previous notebook and build a text classification model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.
For details of each command, run with --help. For example, "%%ml train --help".
When the dataset is small (like with the 20 newsgroup data), there is little benefit of using cloud services. This notebook will run the analyze, transform, and training steps locally. However, we will take the locally trained model and deploy it to ML Engine and show how to make real predictions on a deployed model. Every MLWorkbench magic can run locally or use cloud services (adding --cloud flag).
The next notebook (Text Classification --- 20NewsGroup (large data)) in this sequence shows the cloud version of every command, and gives the normal experience when building models are large datasets. However, we will still use the 20 newsgroup data.
In [71]:
import google.datalab.contrib.mlworkbench.commands # This loads the '%%ml' magics
First, define the dataset we are going to use for training.
In [72]:
%%ml dataset create
name: newsgroup_data
format: csv
train: ./data/train.csv
eval: ./data/eval.csv
schema:
- name: news_label
type: STRING
- name: text
type: STRING
In [73]:
%%ml dataset explore
name: newsgroup_data
The first step in the MLWorkbench workflow is to analyze the data for the requested transformations. We are going to build a bag of words representation on the text and use this in a linear model. Therefore, the analyze step will compute the vocabularies and related statistics of the data for traing.
In [74]:
%%ml analyze
output: ./analysis
data: newsgroup_data
features:
news_label:
transform: target
text:
transform: bag_of_words
In [75]:
!ls ./analysis
This step is optional as training can start from csv data (the same data used in the analysis step). The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training. Because the the 20 newsgroups data is small, this step does not matter, but we do it anyway for illustration. This step is recommended if there are text column in a dataset, and required if there are image columns in a dataset.
We run the transform step for the training and eval data.
In [76]:
!rm -rf ./transform
In [77]:
%%ml transform --shuffle
output: ./transform
analysis: ./analysis
data: newsgroup_data
In [78]:
# note: the errors_* files are all 0 size, which means no error.
!ls ./transform/ -l -h
Create a "transformed dataset" to use in next step.
In [79]:
%%ml dataset create
name: newsgroup_transformed
train: ./transform/train-*
eval: ./transform/eval-*
format: transformed
In [80]:
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!rm -fr ./train
The following training step takes about 10~15 minutes.
In [81]:
%%ml train
output: ./train
analysis: ./analysis/
data: newsgroup_transformed
model_args:
model: linear_classification
top-n: 5
Go to Tensorboard (link shown above) to monitor the training progress. Note that training stops when it detects accuracy is no longer increasing for eval data.
In [82]:
# You can also plot the summary events which will be saved with the notebook.
from google.datalab.ml import Summary
summary = Summary('./train')
summary.list_events()
Out[82]:
In [83]:
summary.plot(['loss', 'accuracy'])
The output of training is two models, one in training_output/model and another in training_output/evaluation_model. These tensorflow models are identical except the latter assumes the target column is part of the input and copies the target value to the output. Therefore, the latter is ideal for evaluation.
In [84]:
!ls ./train/
In [85]:
%%ml batch_predict
model: ./train/evaluation_model/
output: ./batch_predict
format: csv
data:
csv: ./data/eval.csv
In [86]:
# It creates a results csv file, and a results schema json file.
!ls ./batch_predict
Note that the output of prediction is a csv file containing the score for each label class. 'predicted_n' is the label for the nth largest score. We care about 'predicted', the final model prediction.
In [87]:
!head -n 5 ./batch_predict/predict_results_eval.csv
In [88]:
%%ml evaluate confusion_matrix --plot
csv: ./batch_predict/predict_results_eval.csv
In [89]:
%%ml evaluate accuracy
csv: ./batch_predict/predict_results_eval.csv
Out[89]:
In [90]:
# Create bucket
!gsutil mb gs://bq-mlworkbench-20news-lab
!gsutil cp -r ./batch_predict/predict_results_eval.csv gs://bq-mlworkbench-20news-lab
In [91]:
# Use Datalab's Bigquery API to load CSV files into table.
import google.datalab.bigquery as bq
import json
with open('./batch_predict/predict_results_schema.json', 'r') as f:
schema = json.load(f)
# Create BQ Dataset
bq.Dataset('newspredict').create()
# Create the table
table = bq.Table('newspredict.result1').create(schema=schema, overwrite=True)
table.load('gs://bq-mlworkbench-20news-lab/predict_results_eval.csv', mode='overwrite',
source_format='csv', csv_options=bq.CSVOptions(skip_leading_rows=1))
Out[91]:
Now, run any SQL queries on "table newspredict.result1". Below we query all wrong predictions.
In [92]:
%%bq query
SELECT * FROM newspredict.result1 WHERE predicted != target
Out[92]:
In [93]:
%%ml predict
model: ./train/model/
headers: text
data:
- nasa
- windows xp
"%%ml explain" gives you insights on what are important features in the prediction data that contribute positively or negatively to certain labels. We use LIME under "%%ml explain". (LIME is an open sourced library performing feature sensitivity analysis. It is based on the work presented in this paper. LIME is included in Datalab.)
In this case, we will check which words in text are contributing most to the predicted label.
In [94]:
# Pick some data from eval csv file. They are cleaned text.
# The truth labels for the following 3 instances are
# - rec.autos
# - comp.windows.x
# - talk.politics.mideast
instance0 = ('little confused models [number] [number] heard le se someone tell differences far features ' +
'performance curious book value [number] model less book value usually words demand ' +
'year heard mid spring early summer best buy')
instance1 = ('hi requirement closing opening different display servers within x application manner display ' +
'associated client proper done during transition problems')
instance2 = ('attacking drive kuwait country whose citizens close blood business ties saudi citizens thinks ' +
'helped saudi arabia least eastern muslim country doing anything help kuwait protect saudi arabia ' +
'indeed masses citizens demonstrating favor butcher saddam killed muslims killing relatively rich ' +
'muslims nose west saudi arabia rolled iraqi invasion charge saudi arabia idea governments official ' +
'religion de facto de human nature always ones rise power world country citizens leader slick ' +
'operator sound guys angels posting edited stuff following friday york times reported group definitely ' +
'conservative followers house rule country enough reported besides complaining government conservative ' +
'enough asserted approx [number] [number] kingdom charge under saudi islamic law brings death penalty ' +
'diplomatic guy bin isn called severe punishment [number] women drove public while protest ban women ' +
'driving guy group said al said women fired jobs happen heard muslims ban women driving basis qur etc ' +
'yet folks ban women called choose rally behind hate women allowed tv radio immoral kingdom house neither ' +
'least nor favorite government earth restrict religious political lot among things likely replacements ' +
'going lot worse citizens country house feeling heat lately last six months read religious police ' +
'government western women fully stupid women imo sends wrong signals morality read cracked down few home ' +
'based religious posted government owned newspapers offering money turns group dare worship homes secret ' +
'place government grown try take wind conservative opposition things small taste happen guys house trying ' +
'long run others general west evil zionists rule hate west crowd')
data = [instance0, instance1, instance2]
In [95]:
%%ml predict
model: ./train/model/
headers: text
data: $data
The first and second instances are predicted correctly. The third is wrong. Below we run "%%ml explain" to understand more.
In [96]:
%%ml explain --detailview_only
model: ./train/model
labels: rec.autos
type: text
data: $instance0
In [97]:
%%ml explain --detailview_only
model: ./train/model
labels: comp.windows.x
type: text
data: $instance1
On instance 2, the top prediction result does not match truth. Predicted is "talk.politics.guns" while truth is "talk.politics.mideast". So let's analyze these two labels.
In [98]:
%%ml explain --detailview_only
model: ./train/model
labels: talk.politics.guns,talk.politics.mideast
type: text
data: $instance2
Now that we have a trained model, have analyzed the results, and have tested the model output locally, we are ready to deploy it to the cloud for real predictions.
Deploying a model requires the files are on GCS. The next few cells makes a bucket on GCS, copies the locally trained model, and deploys it.
In [99]:
!gsutil -q mb gs://bq-mlworkbench-20news-lab
In [100]:
# Move the regular model to GCS
!gsutil -m cp -r ./train/model gs://bq-mlworkbench-20news-lab
See this doc https://cloud.google.com/ml-engine/docs/how-tos/managing-models-jobs for a the definition of ML Engine models and versions. An ML Engine version runs predictions and is contained in a ML Engine model. We will create a new ML Engine model, and depoly the TensorFlow graph as a ML Engine version. This can be done using gcloud (see https://cloud.google.com/ml-engine/docs/how-tos/deploying-models), or Datalab which we use below.
In [101]:
%%ml model deploy
path: gs://bq-mlworkbench-20news-lab
name: news.alpha
A common task is to call a deployed model from different applications. Below is an example of writing a python client to run prediction.
Covering model permissions topics is outside the scope of this notebook, but for more information see https://cloud.google.com/ml-engine/docs/tutorials/python-guide and https://developers.google.com/identity/protocols/application-default-credentials .
In [102]:
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
from googleapiclient import errors
# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
your_project_ID=google.datalab.Context.default().project_id,
model_name='news',
version_name='alpha')
# Get application default credentials (possible only if the gcloud tool is
# configured on your machine). See https://developers.google.com/identity/protocols/application-default-credentials
# for more info.
credentials = GoogleCredentials.get_application_default()
# Build a representation of the Cloud ML API.
ml = discovery.build('ml', 'v1', credentials=credentials)
# Create a dictionary containing data to predict.
# Note that the data is a list of csv strings.
body = {
'instances': ['nasa',
'windows ex']}
# Create a request
request = ml.projects().predict(
name=api_path,
body=body)
print('The JSON request: \n')
print(request.to_json())
# Make the call.
try:
response = request.execute()
print('\nThe response:\n')
print(json.dumps(response, indent=2))
except errors.HttpError, err:
# Something went wrong, print out some information.
print('There was an error. Check the details:')
print(err._get_reason())
To demonstrate prediction client further, check API Explorer (https://developers.google.com/apis-explorer). it allows you to send raw HTTP requests to many Google APIs. This is useful for understanding the requests and response, and help you build your own client with your favorite language.
Please visit https://developers.google.com/apis-explorer/#search/ml%20engine/ml/v1/ml.projects.predict and enter the following values for each text box.
In [103]:
# The output of this cell is placed in the name box
# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
your_project_ID=google.datalab.Context.default().project_id,
model_name='news',
version_name='alpha')
print('Place the following in the name box')
print(api_path)
The fields text box can be empty.
Note that because we deployed the non-evaluation model, our depolyed model takes a csv input which only has one column. In general, the "instances" is a list of csv strings for models trained by MLWorkbench.
Click in the request body box, and note a small drop down menu appears in the FAR RIGHT of the input box. Slect "Freeform editor". Then enter the following in the request body box.
In [104]:
print('Place the following in the request body box')
request = {'instances': ['nasa', 'windows xp']}
print(json.dumps(request))
Then click the "Authorize and execute" button. The prediction results are returned in the browser.
In [105]:
%%ml model delete
name: news.alpha
In [ ]:
%%ml model delete
name: news
In [107]:
# Delete the GCS bucket
!gsutil -m rm -r gs://bq-mlworkbench-20news-lab
In [108]:
# Delete BQ table
bq.Dataset('newspredict').delete(delete_contents = True)